Abuzar Akhtar
B.tech CSE

basic-statistics-3b76de13-656d-4417-b146-c8c494eeb349

B a s i c S t a t i s t i c s B a s i c S t a t i s t i c s BasicStatisticsBasic\,StatisticsBasicStatistics


Measures of Central Tendency

Measures of central tendency are statistical measures used to describe the center or typical value of a dataset. They provide a single value around which the data tends to cluster. There are several measures of central tendency, including moments, skewness, and kurtosis. Each of these measures provides different insights into the distribution of data and helps in understanding its characteristics. Let's explore each of them:

Moments

Moments are numerical values that summarize the shape and distribution of a dataset. The term "moment" comes from mathematical physics and is used in statistics to characterize the distribution of data points around the mean. There are several types of moments, but the first four moments are the most commonly used:
  • First Moment (Mean): First Moment (Mean): "First Moment (Mean):"\textbf{First Moment (Mean):}First Moment (Mean): The first moment is the arithmetic mean of a dataset and is denoted by μ μ mu\muμ . It represents the center of mass or average value of the data points.

    Mathematically, the mean μ μ mu\muμ of a dataset with n n nnn data points { x 1 , x 2 , . . . , x n } { x 1 , x 2 , . . . , x n } {x_(1),x_(2),...,x_(n)}\{x_1, x_2, ..., x_n\}{x1,x2,...,xn} is calculated as:

    μ = 1 n i = 1 n x i μ = 1 n i = 1 n x i mu=(1)/(n)sum_(i=1)^(n)x_(i)\begin{equation*} \mu = \frac{1}{n} \sum_{i=1}^{n}x_i \end{equation*}μ=1ni=1nxi
    For a continuous random variable X X XXX with probability density function f ( x ) f ( x ) f(x)f(x)f(x), the first moment about the origin (mean) is given by:

    μ = x f ( x ) d x from to + μ = x f ( x ) d x from to + mu=intx*f(x)dxquad"from"-oo"to"+oo\mu = \int{x \cdot f(x) \, dx} \quad \text{from} \, -\infty \, \text{to} \, +\inftyμ=xf(x)dxfromto+
  • Second Moment (Variance): Second Moment (Variance): "Second Moment (Variance):"\textbf{Second Moment (Variance):}Second Moment (Variance): The second moment measures the spread or dispersion of the data points around the mean. It is denoted by σ 2 σ 2 sigma^(2)\sigma^2σ2 (sigma squared) and is the average of the squared differences between each data point and the mean.

    Mathematically, the variance σ 2 σ 2 sigma^(2)\sigma^2σ2 of a dataset with n n nnn data points { x 1 , x 2 , . . . , x n } { x 1 , x 2 , . . . , x n } {x_(1),x_(2),...,x_(n)}\{x_1, x_2, ..., x_n\}{x1,x2,...,xn} and mean μ μ mu\muμ is calculated as:

    σ 2 = 1 n i = 1 n ( x i μ ) 2 σ 2 = 1 n i = 1 n ( x i μ ) 2 sigma^(2)=(1)/(n)sum_(i=1)^(n)(x_(i)-mu)^(2)\begin{equation*} \sigma^2 = \frac{1}{n} \sum_{i=1}^{n}(x_i-\mu)^2 \end{equation*}σ2=1ni=1n(xiμ)2
    For a continuous random variable X X XXX with probability density function f ( x ) f ( x ) f(x)f(x)f(x), the second moment about the origin (variance) is given by:

    σ 2 = ( x μ ) 2 f ( x ) d x from to + σ 2 = ( x μ ) 2 f ( x ) d x from to + sigma^(2)=int(x-mu)^(2)*f(x)dxquad"from"-oo"to"+oo\sigma^2 = \int{(x - \mu)^2 \cdot f(x) \, dx} \quad \text{from} \, -\infty \, \text{to} \, +\inftyσ2=(xμ)2f(x)dxfromto+
  • Third Moment (Skewness): Third Moment (Skewness): "Third Moment (Skewness):"\textbf{Third Moment (Skewness):}Third Moment (Skewness): Skewness measures the asymmetry of the data distribution. It indicates whether the data is skewed to the left (negative skewness) or to the right (positive skewness) compared to the mean. Skewness is denoted by γ 1 γ 1 gamma_(1)\gamma_1γ1 (gamma one).

    Mathematically, the skewness γ 1 γ 1 gamma_(1)\gamma_1γ1 of a dataset with n n nnn data points { x 1 , x 2 , . . . , x n } { x 1 , x 2 , . . . , x n } {x_(1),x_(2),...,x_(n)}\{x_1, x_2, ..., x_n\}{x1,x2,...,xn} and mean μ μ mu\muμ is calculated as:

    γ 1 = 1 n i = 1 n ( x i μ ) 3 ( 1 n i = 1 n ( x i μ ) 2 ) 3 2 γ 1 = 1 n i = 1 n ( x i μ ) 3 1 n i = 1 n ( x i μ ) 2 3 2 gamma_(1)=((1)/(n)sum_(i=1)^(n)(x_(i)-mu)^(3))/(((1)/(n)sum_(i=1)^(n)(x_(i)-mu)^(2))^((3)/(2)))\begin{equation*} \gamma_1 = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^3}{\left(\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2\right)^{\frac{3}{2}}} \end{equation*}γ1=1ni=1n(xiμ)3(1ni=1n(xiμ)2)32
    For a continuous random variable X X XXX with probability density function f ( x ) f ( x ) f(x)f(x)f(x), the third central moment (related to skewness) is given by:

    γ 3 = ( x μ ) 3 f ( x ) d x from to + γ 3 = ( x μ ) 3 f ( x ) d x from to + gamma_(3)=int(x-mu)^(3)*f(x)dxquad"from"-oo"to"+oo\gamma_3 = \int{(x - \mu)^3 \cdot f(x) \, dx} \quad \text{from} \, -\infty \, \text{to} \, +\inftyγ3=(xμ)3f(x)dxfromto+
  • Fourth Moment (Kurtosis): Fourth Moment (Kurtosis): "Fourth Moment (Kurtosis):"\textbf{Fourth Moment (Kurtosis):}Fourth Moment (Kurtosis): Kurtosis measures the shape of the distribution and assesses the presence of outliers or heavy tails. High kurtosis indicates a sharper peak and heavier tails, while low kurtosis indicates a flatter peak and lighter tails. Kurtosis is denoted by γ 2 γ 2 gamma_(2)\gamma_2γ2 (gamma two).

    Mathematically, the kurtosis γ 2 γ 2 gamma_(2)\gamma_2γ2 of a dataset with n n nnn data points { x 1 , x 2 , . . . , x n } { x 1 , x 2 , . . . , x n } {x_(1),x_(2),...,x_(n)}\{x_1, x_2, ..., x_n\}{x1,x2,...,xn} and mean μ μ mu\muμ is calculated as:

    γ 2 = 1 n i = 1 n ( x i μ ) 4 ( 1 n i = 1 n ( x i μ ) 2 ) 2 γ 2 = 1 n i = 1 n ( x i μ ) 4 1 n i = 1 n ( x i μ ) 2 2 gamma_(2)=((1)/(n)sum_(i=1)^(n)(x_(i)-mu)^(4))/(((1)/(n)sum_(i=1)^(n)(x_(i)-mu)^(2))^(2))\begin{equation*} \gamma_2 = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^4}{\left(\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2\right)^2} \end{equation*}γ2=1ni=1n(xiμ)4(1ni=1n(xiμ)2)2
    For a continuous random variable X X XXX with probability density function f ( x ) f ( x ) f(x)f(x)f(x), the fourth central moment (related to kurtosis) is given by:

    γ 4 = ( x μ ) 4 f ( x ) d x from to + γ 4 = ( x μ ) 4 f ( x ) d x from to + gamma_(4)=int(x-mu)^(4)*f(x)dxquad"from"-oo"to"+oo\gamma_4 = \int{(x - \mu)^4 \cdot f(x) \, dx} \quad \text{from} \, -\infty \, \text{to} \, +\inftyγ4=(xμ)4f(x)dxfromto+

Skewness

Skewness is a measure of the asymmetry of a probability distribution. It quantifies the extent to which a dataset deviates from being symmetrical. The skewness value can be positive, negative, or zero:
  • Positive Skewness: Positive Skewness: "Positive Skewness:"\textbf{Positive Skewness:}Positive Skewness: A positive skewness value indicates that the data is skewed to the right, meaning the tail on the right side of the distribution is longer, and the majority of the data points are concentrated on the left side.
  • Negative Skewness: Negative Skewness: "Negative Skewness:"\textbf{Negative Skewness:}Negative Skewness: A negative skewness value indicates that the data is skewed to the left, meaning the tail on the left side of the distribution is longer, and the majority of the data points are concentrated on the right side.
  • Zero Skewness: Zero Skewness: "Zero Skewness:"\textbf{Zero Skewness:}Zero Skewness: A skewness value close to zero suggests that the data is approximately symmetrically distributed.

Kurtosis

Kurtosis measures the heaviness of the tails of a probability distribution compared to the tails of a normal distribution. It provides information about the presence of extreme values (outliers) and the sharpness of the distribution peak:
  • Leptokurtic: Leptokurtic: "Leptokurtic:"\textbf{Leptokurtic:}Leptokurtic: Positive excess kurtosis ( γ 2 > 0 γ 2 > 0 gamma_(2) > 0\gamma_2 > 0γ2>0) indicates a leptokurtic distribution, which has heavier tails and a sharper peak compared to a normal distribution.
  • Mesokurtic: Mesokurtic: "Mesokurtic:"\textbf{Mesokurtic:}Mesokurtic: A normal distribution has zero excess kurtosis ( γ 2 = 0 γ 2 = 0 gamma_(2)=0\gamma_2 = 0γ2=0) and is referred to as mesokurtic. It means the tails are similar to those of a normal distribution.
  • Platykurtic: Platykurtic: "Platykurtic:"\textbf{Platykurtic:}Platykurtic: Negative excess kurtosis ( γ 2 < 0 γ 2 < 0 gamma_(2) < 0\gamma_2 < 0γ2<0) indicates a platykurtic distribution, which has lighter tails and a flatter peak compared to a normal distribution.
Understanding the moments, skewness, and kurtosis of a dataset helps in better interpreting the data's distribution and making informed decisions in various fields such as finance, economics, and data analysis.

Correlation

Correlation is a statistical measure that quantifies the strength and direction of the linear relationship between two variables. It helps us understand how changes in one variable are associated with changes in another variable. Correlation is essential for determining the degree to which two variables move together or in opposite directions.

Pearson Correlation Coefficient

The most common method of measuring correlation is the Pearson correlation coefficient ( r r rrr). It ranges from -1 to 1, where:
  • r = 1 r = 1 r=1r = 1r=1 indicates a perfect positive correlation (both variables increase together linearly).
  • r = 1 r = 1 r=-1r = -1r=1 indicates a perfect negative correlation (as one variable increases, the other decreases linearly).
  • r = 0 r = 0 r=0r = 0r=0 indicates no linear correlation (the variables are not linearly related).

Spearman and Kendall Correlation

In cases where the relationship between variables is not linear, Spearman and Kendall correlation coefficients are used. Spearman correlation calculates the correlation between the ranks of data points, while Kendall correlation measures the similarity of the ordering of data points.

Interpreting Correlation

It's essential to remember that correlation does not imply causation. A high correlation between two variables does not necessarily mean one variable causes the other to change. Causality requires additional analysis and experiments.

Regression

Regression is a statistical technique used to model the relationship between a dependent variable (response) and one or more independent variables (predictors). It enables us to make predictions and understand how the dependent variable changes as the independent variables vary.

Linear Regression

Linear regression is the most common form of regression. It aims to fit a straight line (linear equation) to the data points that best describes the relationship between the dependent and independent variables. The equation of a simple linear regression model is of the form:
y = β 0 + β 1 x + ε y = β 0 + β 1 x + ε y=beta_(0)+beta_(1)*x+epsiy = \beta_0 + \beta_1 \cdot x + \varepsilony=β0+β1x+ε
where:
  • y y yyy is the dependent variable,
  • x x xxx is the independent variable,
  • β 0 β 0 beta_(0)\beta_0β0 is the intercept (y-value when x x xxx is 0),
  • β 1 β 1 beta_(1)\beta_1β1 is the slope (change in y y yyy for a unit change in x x xxx),
  • ε ε epsi\varepsilonε represents the error term (residuals, the difference between the predicted and actual values).

Multiple Regression

In cases with more than one independent variable, multiple regression is used. It extends the linear regression model to accommodate multiple predictors. The equation becomes:
y = β 0 + β 1 x 1 + β 2 x 2 + + β n x n + ε y = β 0 + β 1 x 1 + β 2 x 2 + + β n x n + ε y=beta_(0)+beta_(1)*x_(1)+beta_(2)*x_(2)+dots+beta _(n)*x_(n)+epsiy = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \ldots + \beta_n \cdot x_n + \varepsilony=β0+β1x1+β2x2++βnxn+ε

Interpreting Regression

Regression analysis helps us understand the relationship between variables, predict outcomes, and identify which independent variables are significant predictors of the dependent variable. It is widely used in various fields, including economics, finance, social sciences, and machine learning.

Assumptions

It is crucial to check the assumptions of regression, such as linearity, independence of errors, constant variance of residuals (homoscedasticity), and normality of residuals, to ensure the validity and reliability of the regression model.

Important Note

Correlation and regression are powerful tools, but they should be used carefully and complemented with other analyses to draw meaningful conclusions and make informed decisions based on the data.

Rank Correlation

Rank correlation is a statistical measure used to assess the strength and direction of the association between two variables when the relationship is not linear. It is particularly useful when the data is ordinal, meaning that the variables are ranked or placed in categories with a natural order but no specific numeric value.
There are two commonly used methods for calculating rank correlation:

Spearman Rank Correlation Coefficient (Spearman's ρ ρ rho\rhoρ)

Spearman's rank correlation coefficient is based on the ranks of the data points rather than their actual values. It measures the extent to which the ranks of the data pairs move together or in opposite directions.
Suppose we have two variables, X X XXX and Y Y YYY, with corresponding ranks, R X R X R_(X)R_XRX and R Y R Y R_(Y)R_YRY, and n n nnn data points. The formula for calculating Spearman's rank correlation coefficient ( ρ ρ rho\rhoρ) is given by:
ρ = 1 6 d i 2 n ( n 2 1 ) ρ = 1 6 d i 2 n ( n 2 1 ) rho=1-(6sum(d_(i)^(2)))/(n(n^(2)-1))\begin{equation*} \rho = 1 - \frac{6\sum{d_i^2}}{n(n^2-1)} \end{equation*}ρ=16di2n(n21)
where d i d i d_(i)d_idi is the difference between the ranks of the data pairs ( R X R Y R X R Y R_(X)-R_(Y)R_X - R_YRXRY) for each i i iii-th data point.
The Spearman rank correlation coefficient ranges from 1 1 -1-11 to 1 1 111, where:
  • ρ = 1 ρ = 1 rho=1\rho = 1ρ=1 indicates a perfect positive rank correlation (the ranks of both variables are exactly the same).
  • ρ = 1 ρ = 1 rho=-1\rho = -1ρ=1 indicates a perfect negative rank correlation (the ranks of one variable are the reverse of the ranks of the other variable).
  • ρ = 0 ρ = 0 rho=0\rho = 0ρ=0 indicates no rank correlation (the ranks are independent of each other).

Kendall Rank Correlation Coefficient (Kendall's τ τ tau\tauτ)

Kendall's tau is another rank correlation measure that quantifies the similarity of the ordering of data points between two variables. It compares the number of concordant and discordant pairs of data.
Suppose we have two variables, X X XXX and Y Y YYY, with n n nnn data points. The formula for calculating Kendall's tau ( τ τ tau\tauτ) is given by:
τ = Number of concordant pairs Number of discordant pairs n ( n 1 ) 2 τ = Number of concordant pairs Number of discordant pairs n ( n 1 ) 2 tau=("Number of concordant pairs"-"Number of discordant pairs")/((n(n-1))/(2))\begin{equation*} \tau = \frac{\text{Number of concordant pairs} - \text{Number of discordant pairs}}{\frac{n(n-1)}{2}} \end{equation*}τ=Number of concordant pairsNumber of discordant pairsn(n1)2
The value of Kendall's tau ranges from 1 1 -1-11 to 1 1 111, where:
  • τ = 1 τ = 1 tau=1\tau = 1τ=1 indicates a perfect positive rank correlation (all data pairs are concordant).
  • τ = 1 τ = 1 tau=-1\tau = -1τ=1 indicates a perfect negative rank correlation (all data pairs are discordant).
  • τ = 0 τ = 0 tau=0\tau = 0τ=0 indicates no rank correlation (the number of concordant and discordant pairs are equal).
Rank correlation is particularly useful when dealing with data that cannot be measured on a continuous scale and may not follow a linear relationship. It provides a non-parametric measure of association, making it robust to outliers and resistant to certain types of data transformations. Rank correlation is widely used in fields such as sociology, psychology, and market research, where the data is often ranked or categorized.